Random Forests for Big Data
نویسندگان
چکیده
Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include data streams and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper reviews available proposals about random forests in parallel environments as well as about online random forests. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment three variants involving subsampling, Big Data-bootstrap and MapReduce respectively, on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data.
منابع مشابه
Exploratory Data Analysis using Random Forests
Although the rise of "big data" has made machine learning algorithms more visible and relevant for social scientists, they are still widely considered to be "black box" models that are not well suited for substantive research: only prediction. We argue that this need not be the case, and present one method, Random Forests, with an emphasis on its practical application for exploratory analysis a...
متن کاملEnergy Efficient Data Mining Scheme for Big Data Biodiversity Environment
In this paper, we propose a novel energy efficient data mining scheme for big data biodiversity environment. Efficient machine learning and data mining techniques provide an unprecedented opportunity to monitor and characterize big data biodiversity environments, such as forest cover type, monitored using low cost wireless sensor networks. However, given the sheer amount of data collected by th...
متن کاملBig data for microstructure-property relationships: a case study of predicting effective conductivities
The analysis of big data is changing industries, businesses and research since large amounts of data are available nowadays. In the area of microstructures, acquisition of (3D tomographic image) data is difficult and time-consuming. It is shown that large amounts of data representing the geometry of virtual, but realistic 3D microstructures can be generated using stochastic microstructure model...
متن کاملImplementation of Random Forest Algorithm in Order to Use Big Data to Improve Real-Time Traffic Monitoring and Safety
Nowadays the active traffic management is enabled for better performance due to the nature of the real-time large data in transportation system. With the advancement of large data, monitoring and improving the traffic safety transformed into necessity in the form of actively and appropriately. Per-formance efficiency and traffic safety are considered as an im-portant element in measuring the pe...
متن کاملDetection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections
BACKGROUND Big data is steadily growing in epidemiology. We explored the performances of methods dedicated to big data analysis for detecting independent associations between exposures and a health outcome. METHODS We searched for associations between 303 covariates and influenza infection in 498 subjects (14% infected) sampled from a dedicated cohort. Independent associations were detected u...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Big Data Research
دوره 9 شماره
صفحات -
تاریخ انتشار 2017